The project aims to analyze data concerning international students and rental properties in bayren, Germany. By examining this information, the project seeks to provide valuable insights to newcomers, helping them identify cities in bayren, Germany with the most favorable accommodation options and thus facilitating their transition to the country.
Through an examination of the trends in the influx of foreign students across various cities in Bavaria, Germany, and an assessment of the corresponding accommodation options in these cities, this analysis aims to offer valuable recommendations. These insights can be instrumental for prospective students, landlords, and local authorities in identifying cities with higher probabilities of finding suitable accommodation, thereby facilitating informed decision-making in the pursuit of education in Bavaria.
Datasource1: Foreign students in Germany
This dataset comprises information about international students admitted to degree programs in bayren Germany over a specific time period. It includes data on the students' nationalities, degree programs, admission dates, and other relevant attributes.
Datasource2: Immoscout24 dataset
This dataset comprises information about rental property listings from Immoscout24, which is the largest real estate platform in bayren Germany. It includes details about rental prices, property characteristics, and the property's location.
Import all necessary Libraries
# Import all necessary libraries
import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
import pandas as pd
import folium
import matplotlib.pyplot as plt
from IPython.display import display
# Display muninch_map
image = Image.open('output/project-flow.png')
display(image)
Immoscout dataset was sourced from Kaggle, by enabling Kaggle API, from kaggle "API" section and click on "Create New API Token". This will download a file named kaggle.json to your computer then Store Kaggle API Key in your system and the second dataset of Forigen students in Bavaria Germany was sourced from europa.eu.
#Data Extraction
def fetch_kaggle_dataset(dataset, target_folder, filename):
api = KaggleApi()
api.authenticate()
api.authenticate()
username, dataset_name = dataset.split('/')[-2:]
zip_file_path = os.path.join(target_folder,f"{dataset_name}.zip")
api.dataset_download_files(f"{username}/{dataset_name}", path=target_folder, unzip=False)
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extract(filename, path=target_folder)
def data_extraction_csv(path):
t1 = time()
print("Data Extraction in progress...")
try:
df = pd.read_csv(path, encoding='latin-1')
except Exception as e:
print("Error occurred during file reading:", str(e))
return None
t2 = time()
print("Finish: Data Extraction {} s ".format(t2 - t1))
return df
def data_extraction_xls(path):
t1 = time()
print("Data Extraction in progress...")
try:
df = pd.read_excel(path, engine='openpyxl')
except Exception as e:
print("Error occurred during file reading:", str(e))
return None
t2 = time()
print("Finish: Data Extraction {} s ".format(t2 - t1))
return df
Data Transformation
# Check if column is empty and drop corresponding rows
df1.dropna(subset=['regio1', 'regio2', 'noRooms'], inplace=True)
# Fetch only Bayren state dataset
df1 = df1[df1['regio1'] == 'Bayern']
# correct the formats of data values
df1['regio2'] = df1['regio2'].str.encode('latin-1').str.decode('utf-8')
# only fetch data with city name available in students data
university_cities = df2['City'].unique()
df1 = df1[df1['regio2'].isin(university_cities)]
# Format dataset
df1 = data_transformation(df1, df1_rename_cols, df1_drop_cols)
data_loader("../dataset.sqlite", df1, "immoscout")
# correct date format in year cloumn
df2['Jahr'] = pd.to_datetime(df2['Jahr'].str.split('/').str[0], format='%Y')
df2['Jahr'] = df2['Jahr'].dt.strftime('%Y')
#drop rows with num values
df2 = df2.dropna()
# List of columns to convert from float64 to int
columns = ['Geisteswissenschaften', 'Sozialwissenschaften', 'Mathematik',
'Ingenieurwissenschaften', 'Informatik',
'medizin', 'Landwirtschaft']
# Convert selected columns to int
df2[columns] = df2[columns].astype(int)
Rename Columns
df1_rename_cols = {
"regio1": "federalState",
"geo_plz": "zipCode",
"regio2": "City",
"regio3": "Town"
}
df2_rename_cols = {
"Jahr": "Year",
"Stadt": "City",
"Universität": "University",
"Geisteswissenschaften": "Humanities",
"Sozialwissenschaften": "Social sciences",
"Mathematik": "Mathematics",
"Ingenieurwissenschaften": "Engineering sciences",
"Informatik": "Computer Science",
"medizin": "Medicine",
"Landwirtschaft": "Agriculture"
}
Drop Columns
df1_drop_cols = ["picturecount", "scoutId", "geo_bln", "geo_krs", "telekomHybridUploadSpeed", "telekomTvOffer", "newlyConst", "balcony", "picturecount", "pricetrend", "telekomUploadSpeed", "scoutId", "firingTypes", "yearConstructedRange", "interiorQual", "petsAllowed", "streetPlain", "lift", "baseRentRange", "typeOfFlat", "energyEfficiencyClass", "lastRefurbish", "electricityBasePrice", "electricityKwhPrice", "date"]
# Query the database and load all data into a DataFrame
immoscout_table = "SELECT * FROM immoscout"
immoscout_table = pd.read_sql_query(immoscout_table, conn)
# Create a geocoder instance
geolocator = Nominatim(user_agent="my_app")
# Initialize lists to store latitude and longitude values
latitudes = []
longitudes = []
# Iterate over each row in the DataFrame
for index, row in immoscout_table.iterrows():
# Check if the city is Munich or Nuremberg
if row['City'] in ['München', 'Nürnberg']:
address = f"{row['houseNumber']} {row['street']}, {row['Town']}, {row['City']}, {row['zipCode']} , {row['federalState']}"
try:
location = geolocator.geocode(address, timeout=10)
if location is not None:
latitudes.append(location.latitude)
longitudes.append(location.longitude)
print(address, ":", location.latitude, location.longitude)
else:
latitudes.append(None)
longitudes.append(None)
except GeocoderTimedOut:
print(f"Geocoding timed out for address: {address}")
else:
# For other cities, use the existing latitude and longitude values
latitudes.append(None)
longitudes.append(None)
# Add latitude and longitude columns to the DataFrame
immoscout_table['latitude'] = latitudes
immoscout_table['longitude'] = longitudes
# Save the updated DataFrame back to the database
immoscout_table.to_sql('immoscout', conn, if_exists='replace', index=False)
In presenting the initial facets of our dataset, I have included the display of the dataframe head. This snapshot offers a brief yet insightful overview of the dataset's architecture, showcasing variable names, data types, and sample values. This intentional presentation serves as a precursor to more in-depth analyses and discussions, enabling a swift understanding of the dataset's composition before further exploration.
# Connect to the database
conn = sqlite3.connect('../dataset.sqlite')
# Query the database and load data into a DataFrames
immoscout_table = f"SELECT * FROM immoscout"
immoscout_df = pd.read_sql_query(immoscout_table, conn)
immoscout_df.head()
| federalState | serviceCharge | heatingType | totalRent | yearConstructed | noParkSpaces | hasKitchen | cellar | baseRent | houseNumber | ... | noRoomsRange | garden | livingSpaceRange | City | Town | description | facilities | heatingCosts | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bayern | 180.0 | central_heating | 2030.0 | 1871.0 | 0.0 | 1 | 1 | 1850.0 | 0 | ... | 2 | 0 | 4 | München | Altstadt | Die Wohnung (OHNE Balkon !!) liegt im 4. OG (O... | - Eichenparkettboden im Wohn-EZ u. im Schlafzi... | 0.0 | NaN | NaN |
| 1 | Bayern | 0.0 | 0 | 2400.0 | 1956.0 | 1.0 | 0 | 0 | 2400.0 | 3 | ... | 3 | 0 | 3 | München | Schwabing_West | Highlights:\n\n * alle Wände der Wohnräume, ... | Waschmaschine in der Wohnung\nFlachbildfernseh... | 0.0 | NaN | NaN |
| 2 | Bayern | 170.0 | floor_heating | 970.0 | 1974.0 | 1.0 | 1 | 1 | 800.0 | 0 | ... | 2 | 0 | 5 | Passau | Haidenhof_Süd | WOHNEN UND ENTSPANNEN!\n \nDie Architektur be... | - Gartenwohnung in bester Lage\n- 2 Zimmer + H... | 0.0 | NaN | NaN |
| 3 | Bayern | 0.0 | 0 | 2780.0 | 2016.0 | 0.0 | 0 | 0 | 2780.0 | 51 | ... | 3 | 0 | 4 | München | Milbertshofen | Bitte beachten Sie: Die Wohnung ist komplett m... | Waschmaschine in der Wohnung\nFlachbildfernseh... | 0.0 | 48.18808 | 11.54337 |
| 4 | Bayern | 200.0 | gas_heating | 0.0 | 0.0 | 0.0 | 1 | 0 | 900.0 | 0 | ... | 4 | 0 | 4 | Würzburg | Lengfeld | Helle 4,5-Zimmerwohnung in ruhigem, kleinen Wo... | 0 | 0.0 | NaN | NaN |
5 rows × 28 columns
# Connect to the database
conn = sqlite3.connect('../dataset.sqlite')
# Query the database and load forigen students data into a DataFrames
intstudents_table = f"SELECT * FROM intstudents"
intstudents_df = pd.read_sql_query(intstudents_table, conn)
intstudents_df.head()
| Region | University | City | Year | Humanities | Social sciences | Mathematics | Engineering sciences | Computer Science | Medicine | Agriculture | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | bayren | Ostbayerische Technische Hochschule (OTH) Rege... | Regensburg | 1998 | 30 | 16 | 58 | 4 | 74 | 18 | 16 |
| 1 | bayren | Ostbayerische Technische Hochschule (OTH) Rege... | Regensburg | 1999 | 20 | 20 | 58 | 0 | 36 | 20 | 20 |
| 2 | bayren | Ostbayerische Technische Hochschule (OTH) Rege... | Regensburg | 2000 | 27 | 18 | 58 | 2 | 31 | 18 | 18 |
| 3 | bayren | Ostbayerische Technische Hochschule (OTH) Rege... | Regensburg | 2001 | 27 | 18 | 58 | 4 | 36 | 18 | 18 |
| 4 | bayren | Ostbayerische Technische Hochschule (OTH) Rege... | Regensburg | 2002 | 42 | 18 | 58 | 4 | 42 | 18 | 18 |
A bar plot displaying the distribution of the number of students according to the cities of Bavaria.
# List of columns to convert from float64 to int
columns = ['Humanities', 'Social sciences', 'Mathematics',
'Engineering sciences', 'Computer Science',
'Medicine', 'Agriculture']
# Calculate total students and add a new column
intstudents_df['Total-Student'] = intstudents_df[columns].sum(axis=1)
columns_to_use = ['City', 'Total-Student']
df_bar = intstudents_df[columns_to_use].groupby('City').sum()
# Sort the DataFrame by Total-Student in ascending order
df_bar = df_bar.sort_values(by='Total-Student', ascending=False)
# Set a seaborn color palette for different cities
city_palette = sns.color_palette("husl", n_colors=len(df_bar))
# Plot the compound bar graph using seaborn
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=df_bar.index, y='Total-Student', data=df_bar, palette=city_palette, hue=df_bar.index)
ax.set_xlabel('Cities')
ax.set_ylabel('No of Students')
ax.set_title('Distribution of Forigen Students in Cities of Bavaria')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')
# Remove the legend, as it's not needed in this case
ax.legend().set_visible(False)
plt.show()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Overall, the bar graph provides a useful overview of the distribution of students in Bavaria. It shows that the number of students is highest in the larger cities, and that there is a wide range in the number of students across different cities.
A trendline plot displaying the trend of foreign students studying in different cities of Bavaria over the last 20 years.
columns_to_use = ['City', 'Year', 'Total-Student']
df_trendline = intstudents_df[columns_to_use].copy()
df_trendline = df_trendline[df_trendline['Year'] > '1998']
# Group by 'City' and 'Year', and sum the 'Total-Student' values within each group
df_trendline_grouped = df_trendline.groupby(['City', 'Year'])['Total-Student'].sum().reset_index()
# Sort the DataFrame by 'Year'
df_trendline_grouped = df_trendline_grouped.sort_values(by='Year')
sns.set(style="darkgrid")
# Create a trendline plot
plt.figure(figsize=(12, 8))
ax = sns.lineplot(x='Year', y='Total-Student', hue='City', data=df_trendline_grouped, errorbar=None, marker='o', markersize=8)
plt.title('Trendline Plot of Forigen Students Over 20 Years')
plt.xlabel('Year')
plt.ylabel('Total Students')
plt.legend(title='City', loc='upper left', bbox_to_anchor=(1, 1))
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')
plt.show()
Overall, the trendline plot suggests that Bavaria is becoming an increasingly popular destination for international students. This is likely due to a number of factors, including the high quality of education, the affordable cost of living, and the strong economy of the state.
A pie plot displaying the cities that contain a higher percentage of available apartments in different cities of Bavaria.
apartments_wrt_town = immoscout_df['City'].value_counts()
# Sort the values for better visualization
apartments_wrt_town_sorted = apartments_wrt_town.sort_values(ascending=False)
# Set a seaborn color palette for different cities
town_palette = sns.color_palette("tab10", n_colors=len(apartments_wrt_town_sorted))
# Plot the pie chart using seaborn
plt.figure(figsize=(8, 8))
plt.pie(apartments_wrt_town_sorted.values, labels=apartments_wrt_town_sorted.index, autopct='%1.1f%%', colors=town_palette)
plt.title('Percentage of apartments based on cities in Bavaria')
plt.axis('equal')
plt.show()
Overall, the pie chart provides a useful overview of the availability of apartments in Bavaria. It shows that there is a shortage of affordable housing in some parts of the state, and that this shortage is particularly acute in rural areas.
A bubble plot visualizing the locations of apartments available in Munich and Nuremberg based on longitude and latitude values calculated using the apartment addresses.
# Extract data for Munich and Nuremberg
munich_data = immoscout_df[immoscout_df['City'] == 'München'][['latitude', 'longitude']].dropna().values.tolist()
nuremberg_data = immoscout_df[immoscout_df['City'] == 'Nürnberg'][['latitude', 'longitude']].dropna().values.tolist()
# Create a Folium map
map_center = [48.1351, 11.5820] # Center the map around Munich
munich_map = folium.Map(location=map_center, zoom_start=11)
# Add Circle markers to the map
for location in munich_data:
folium.Circle(location=location, radius=200, color='blue', fill=True).add_to(munich_map)
# Display muninch_map
image = Image.open('output/munich_map.PNG')
display(image)
# Create a Folium map
map_center = [49.4521, 11.0767] # Center the map around Munich
nurnberg_map = folium.Map(location=map_center, zoom_start=11)
# Add Circle markers to the map
for location in nuremberg_data:
folium.Circle(location=location, radius=200, color='blue', fill=True).add_to(nurnberg_map)
# Display nurnberg_map
image = Image.open('output/nurnberg_map.PNG')
display(image)
Overall, the bubble plots provide a useful overview of the distribution of apartments in Munich and Nuremberg. They can be used to identify areas with a high concentration of apartments, to compare the price of apartments in different areas, and to understand the factors that influence apartment prices.
The filled bar graph illustrates the number of foreign students and available accommodations in Bavaria for the year 2022. The data is segmented by cities, providing insights into the accommodation availability for international students in different regions.
# Filter the DataFrame to include only rows where 'noRooms' is an integer
filtered_df = immoscout_df[immoscout_df['noRooms'].astype(int) == immoscout_df['noRooms']]
# Group by Cities and calculate the total number of rooms in each city
total_rooms_per_city = filtered_df.groupby('City')['noRooms'].sum().reset_index()
columns_to_use = ['City', 'Total-Student']
# Fetch student data for year 2022
intstudents_df = intstudents_df[intstudents_df['Year'] == '2022']
#groupby data according to Cities
df_bar = intstudents_df[columns_to_use].groupby('City').sum()
df_bar.reset_index(inplace=True)
df_bar.rename(columns={'City': 'City'}, inplace=True)
# Claulate percatage of number of room occupied and available in cities in 2022
df_bar['Total-Student-percentage'] = (df_bar['Total-Student']/total_rooms_per_city['noRooms'])*100
# Set the Seaborn style to 'whitegrid' for a light background
sns.set_style("whitegrid")
# Create a bar plot with visually attractive color scheme
plt.figure(figsize=(10, 6), facecolor='None')
# Define visually attractive colors
total_students_color = sns.color_palette("Blues")[4]
no_rooms_color = sns.color_palette("Oranges")[4]
ax1 = sns.barplot(x='City', y='noRooms', data=total_rooms_per_city, color=no_rooms_color, edgecolor='black', label='No Rooms')
ax2 = sns.barplot(x='City', y='Total-Student', data=df_bar, color=total_students_color, edgecolor='black', label='Total Students')
# Annotate each bar with values from the 'annotation_column'
for p, annotation_value in zip(ax1.patches, df_bar['Total-Student-percentage']):
ax1.annotate(f'{annotation_value:.0f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# Set labels and title
plt.xlabel('City')
plt.ylabel('Student Count')
plt.title('Total Students and No Rooms by Cities')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')
# Display the legend
plt.legend()
plt.ylim(0,11000)
# Show the plot
plt.show()
Overall, the graph shows that there is a high demand for student accommodation in Bavaria. Students are more likely to stay in dormitories than find other accommodation, and the districts with the most students also have the highest percentage of students with no rooms.
Upon comprehensive analysis of the dataset concerning foreign students and accommodation options in Bavaria, several key insights emerge. Munich stands out as the city with the highest number of students, reflecting its prominence as a major educational hub. However, when considering accommodation availability and affordability, other cities in Bavaria present noteworthy options for students.
Nuremberg and Augsburg, with substantial student populations and a diverse range of accommodation, emerge as promising choices for students seeking housing. These cities not only offer ample accommodation options but also showcase a growing economy, contributing to the overall attractiveness for students.
Furthermore, Passau, despite having a smaller number of students, stands out for its relatively lower percentage of students without accommodation (42%). This suggests that Passau provides better housing prospects for its student population compared to other districts.
In contrast, districts like Würzburg, Ansbach, and Erlangen demonstrate higher percentages of students without rooms, indicating potential challenges in securing suitable housing. While Würzburg records the highest percentage of students without rooms (94%), signaling a pronounced need for enhanced accommodation solutions in this district.
The pie chart highlighting the availability of apartments in Bavaria emphasizes the shortage of affordable housing in certain regions, particularly in rural areas such as Amberg, Rosenheim, and Coburg. Cities like Munich and Nuremberg, with higher percentages of available apartments, offer more promising prospects for students.
In summary, while Munich remains a significant academic and economic center, Nuremberg, Augsburg, and Passau emerge as cities with favorable accommodation options for students. As the demand for student housing continues to rise, these insights can guide students, landlords, and policymakers in making informed decisions regarding accommodation in Bavaria.